~ti~tlstlcallv-Gulded Word Sense Dlsambio_uation

نویسندگان

  • Elizabeth D. Liddy
  • Woojin Paik
چکیده

Within the field of Natural Language Processing, lexical disambiguation remains one of the toughest hurdles to overcome in the development of fully operational systems. As part of a larger document detection system (DR-LINK), we have implemented a computational approximation of word sense disambiguation by combining information from a machine-readable dictionary, local context, and corpus statistics. We use the Subject-Field Codes (SFC) extracted from a machine-readable dictionary produce a preliminary, multi-tagged semantic coding of words in a text. Then we apply local heuristics that evaluate the SFCs of ambiguous words to choose among the multiple SFCs. Choices which cannot be made using local heuristics are resolved by statistical evidence, namely, an SFC correlation matrix that was generated by processing a corpus of 977 Wall Street Journal (WSJ) articles containing 442,059 words. The implementation was tested on a sample of 1638 words from the WSJ and selected the correct SFC 89% of the time. The resultant, disambiguated SFC frequencies are summed and normalized to produce a weighted semantic vector representation of each text. These SFC vectors provide the basis on which the system automatically classifies texts as the first stage in DR-LINK. The Disambiguation Problem NLP systems take naturally occurring text and create a representation of the meaning of the text that will be used to accomplish the specific task of the system, be it machine translation, document detection, question-answering, knowledge extraction, or information retrieval. Lexical ambiguity has been a major stumbling block in the development of real-world NLP systems for all these applications due to the fact that a single word may have more than one meaning. According to Gentner (1981) the twenty most frequent nouns in English have an average of 7.3 senses each, while the twenty most frequent verbs have an average of 12.4 senses each. As a result, when attempting to represent a word which has multiple senses, an NLP system must either produce multiple representations for that word or select one sense from amongst the possible choices included in the system’s lexicon. The process of selecting from amongst a word’s possible senses is referred to as semantic lexical disambiguation. Research into human lexical access and disambiguation has been very active in recent years. Small, Cottrell & Tannenhaus (1988) provide a substantive reader on research on lexical disambiguation from the various fields within cognitive science. In principle, we agree with Small, Cottrell & Tannenhaus (1988) that "in order to resolve ambiguity, an NLU (human or otherwise) has to take into account sources of knowledge". Given that there is no current single theory as to the exact nature of and interaction amongst these sources which can account for all the experimental results in lexical disambiguation, we agree with Prather & Swinney (1988) that there will be ’no uniform, invariant solution to lexical ambiguity resolution". Consequently, we interpret the empirical psycholinguistic results as suggesting that there are three sources of influence on the human disambiguation process: Local context -the sentence containing the ambiguous word restricts the interpretation of ambiguous words Domain knowledge the recognition that a text is concerned with a particular domain activates only the senses appropriate to that domain Frequency data the frequency of each sense’s general usage affects its accessibility 1Support for this research was provided by DARPA under the auspices of the TIPSTER Project. 98 From: AAAI Technical Report FS-92-04. Copyright © 1992, AAAI (www.aaai.org). All rights reserved.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Species Disambiguation for Biomedical Term Identification

An important task in information extraction (IE) from biomedical articles is term identification (TI), which concerns linking entity mentions (e.g., terms denoting proteins) in text to unambiguous identifiers in standard databases (e.g., RefSeq). Previous work on TI has focused on species-specific documents. However, biomedical documents, especially full-length articles, often talk about entiti...

متن کامل

رفع ابهام معنایی واژگان مبهم فارسی با مدل موضوعی LDA

Word sense disambiguation is the task of identifying the correct sense for the word in a given context among a finite set of possible sense. In this paper a model for farsi word sense disambiguation is presented. The model use two group of features: first, all word and stop words around target word and topic models as second features. We extract topics from a farsi corpus with Latent Dirichlet ...

متن کامل

Combining Classi ers for Word

2002 Cambridge University Press 1 Combining Classi ers for Word Sense Disambiguation RADU FLORIAN , S ILV IU CUCERZAN , CHARLES SCHAFER and DAVID YAROWSKY Department of Computer S ien e and Center for Language and Spee h Pro essing Johns Hopkins University, MD 21218, USA fr orian,silviu, s hafer,yarowskyg s.jhu.edu (Re eived ) A bstract Classi er ombination is an e e tive and broadly useful met...

متن کامل

NTCIR-2 Chinese, Cross Language Retrieval Experiments Using PIRCS

We participated in the monolingual Chinese and English-Chinese cross language retrieval track using our PIRCS retrieval system. Employing the query translation approach for crosslingual IR, two methods of translation were tried: MT software, and dictionary lookup followed with disambiguation techniques. Retrieval lists from the two methods were combined to form the final result. Pseudorelevance...

متن کامل

Learning Word Sense Embeddings from Word Sense Definitions

Word embeddings play a significant role in many modern NLP systems. Since learning one representation per word is problematic for polysemous words and homonymous words, researchers propose to use one embedding per word sense. Their approaches mainly train word sense embeddings on a corpus. In this paper, we propose to use word sense definitions to learn one embedding per word sense. Experimenta...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2001